Piecing the puzzle - Self-publishing Queryable Research Data on the Web

نویسنده

  • Ruben Verborgh
چکیده

Publishing research on the Web accompanied by machine-readable data is one of the aims of Linked Research. Merely embedding metadata as RDFa in HTML research articles, however, does not solve the problems of accessing and querying that data. Hence, I created a simple ETL pipeline to extract and enrich Linked Data from my personal website, publishing the result in a queryable way through Triple Pattern Fragments. The pipeline is open source, uses existing ontologies, and can be adapted to other websites. In this article, I discuss this pipeline, the resulting data, and its possibilities for query evaluation on the Web. More than 35,000 RDF triples of my data are queryable, even with federated SPARQL queries because of links to external datasets. This proves that researchers do not need to depend on centralized repositories for readily accessible (meta-)data, but instead can—and should—take matters into their own hands. INTRODUCTION The World Wide Web continues to shape many domains, and not in the least research. On the one hand, the Web beautifully fulfills its role as a distribution channel of scientific knowledge, for which it was originally invented. This spurs interesting dialogues concerning Open Access [1] and even piracy [2] of research articles. On the other hand, the advent of social networking creates new interaction opportunities for researchers, but also forces us to consider our online presence [3]. Various social networks dedicated to research have emerged: Mendeley, ResearchGate, Academia, ... They attract millions of researchers, and employ various tactics to keep us there. A major issue of these social research networks is their lack of mutual complementarity. None of them has become a clear winner in terms of adaption. At first sight, the resulting plurality seems a blessing for diversity, compared to the monoculture of Facebook for social networking in general. Yet whereas other generic social networks such as Twitter and LinkedIn serve complementary professional purposes compared to Facebook, social research networks share nearly identical goals. As an example, a researcher could announce a newly accepted paper on Twitter, discuss its review process on Facebook, and share a photograph of an award on LinkedIn. In contrast, one would typically not exclusively list a specific publication on Mendeley and another on Academia, as neither publication list would be complete. In practice, this results in constant bookkeeping for researchers who want each of their profiles to correctly represent them—a necessity if such profiles are implicitly or explicitly treated as performance indicators [4]. Deliberate absence on any of these networks is not a viable option, as parts of one’s publication metadata might be automatically harvested or entered by co-authors, leaving an automatically generated but incomplete profile. Furthermore, the quality of such non-curated metadata records can be questionable. As a result, researchers who do not actively maintain their online research profiles risk ending up with incomplete and inaccurate publication lists on those networks. Such misrepresentation can be significantly worse than not being present at all—but given the public nature of publication metadata, complete absence is not an enforceable choice. Online representation is not limited to social networks: scientific publishers also make metadata available about their journals and books. For instance, Springer Nature recently released SciGraph, a Linked Open Data platform that includes scholarly metadata. Accuracy is less of an issue in such cases, as data comes directly from the source. However, quality and usability are still influenced by the way data is modeled and whether or how identifiers are disambiguated. Completeness is not guaranteed, given that authors typically target multiple publishers. Therefore, even such authoritative sources do not provide individual researchers with a correct profile. In the spirit of decentralized social networking [5] and Linked Data [6], several researchers instead started publishing their own data and metadata. I am one of them, since I believe in practicing what we preach [7] as Linked Data advocates, and because I want my own website to act as the main authority for my data. After all, I can spend more effort on the completeness and accuracy of my publication metadata than most other platforms could reasonably do for me. In general, self-published data typically resides in separate RDF documents [8] (for which the FOAF vocabulary [9] is particularly popular [10]), or inside of HTML documents (using RDFa Lite [11] or similar formats). Despite the controllable quality of personally maintained research data and metadata in individual documents on the Web, they are not as visible, findable, and queryable as those of social research networks. I call a dataset interface “queryable” with respect to a given query when a consumer does not need to download the entire dataset in order to evaluate that query over it with full completeness. Unfortunately, hosting advanced search interfaces on a personal website quickly becomes complex and expensive. To mitigate this, I have implemented a simple Extract/Transform/Load (ETL) pipeline on top of my personal website, which extracts, enriches, and publishes my Linked Data in a queryable way through a Triple Pattern Fragments [12] interface. The resulting data can be browsed and queried live on the Web, with higher quality and flexibility than on my other online profiles, and at only a limited cost for me as data publisher. This article describes my use case, which resembles that of many other researchers. I detail the design and implementation of the ETL pipeline, and report on its results. At the end, I list open questions regarding self-publication, before concluding with a reflection on the opportunities for the broader research community. Piecing the puzzle Self-publishing queryable research data on the Web

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linked Data-as-a-Service: The Semantic Web Redeployed

Ad-hoc querying is crucial to access information from Linked Data, yet publishing queryable RDF datasets on the Web is not a trivial exercise. The most compelling argument to support this claim is that the Web contains hundreds of thousands of data documents, while only 260 queryable SPARQL endpoints are provided. Even worse, the SPARQL endpoints we do have are often unstable, may not comply wi...

متن کامل

IDB: Unified Query Interface for Information on the Web

XML [4] is likely to become the primary vehicle of the information interchange on the Web. Organizations will publish and export their data in XML to facilitate interand intra-organization information sharing. Businesses will publish their product information in XML for their customers or for software agents for on-line shopping services. This is already beginning to happen: a number of ontolog...

متن کامل

Publishing OWL Ontologies with Presto

Publishing RDF/OWL ontologies on the Semantic Web typically starts by placing the document in a web accessible location and ends with redirects of ontological components (classes, properties, individuals) to that the document. Unfortunately, this is seldom sufficient for expressive OWL ontologies in which reasoning is essential to determine the full extent of the entity in question. Moreover, t...

متن کامل

The usefulness of crossword puzzle as a self-learning tool in pharmacology

Introduction: Pharmacology is perceived as a volatile subject asit’s difficult to recall and recite the core of the subject. Enrichingthe learning environment through incorporation of a variety ofteaching and learning strategies and methods yields enhancedlearning. Crossword puzzles provide expansion of vocabulary,stimulate thinking capacity, boost confidence, and fasten up thelearning capacity...

متن کامل

Toward sustainable publishing and querying of distributed Linked Data archives

Purpose This paper details a low-cost, low-maintenance publishing strategy aimed at unlocking the value of Linked Data collections held by libraries, archives and museums. Design/methodology/approach The shortcomings of commonly used Linked Data publishing approaches are identified, and the current lack of substantial collections of Linked Data exposed by libraries, archives and museums is cons...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017